Distributed training #8

mjconnor · 2023-07-12T22:00:06Z

No description provided.

akshay-anyscale · 2023-07-13T00:41:50Z

LGTM - @matthewdeng to take a look

mjconnor · 2023-07-13T02:31:42Z

@matthewdeng sorry meant to request you. please review and MERGE!!!

matthewdeng · 2023-07-13T17:23:44Z

templates/distributed-training/README.md

+
+To run:
+```bash
+python pytorch.py


Suggest having a slightly more descriptive name here, e.g. train_torch_model.py.

matthewdeng · 2023-07-13T17:25:08Z

templates/distributed-training/README.md

+### Monitor
+After launching the script, you can look at the Ray dashboard. It can be accessed from the Workspace home page and enables users to track things like CPU/GPU utilization, GPU memory usage, remote task statuses, and more!
+
+![Dash](https://github.com/anyscale/templates/releases/download/media/workspacedash.png)


This image is highlighting VSCode 😅

matthewdeng · 2023-07-13T17:27:51Z

templates/distributed-training/README.md

+[See here for more extensive documentation on the dashboard.](https://docs.ray.io/en/latest/ray-observability/getting-started.html)
+
+### Model Saving
+The model will be saved in the Anyscale Artifact Store, which is automatically set up and configured with your Anyscale deployment.


Can we point to Anyscale documentation here? I feel like this is introducing a new concept for something that should be more simple (a cloud storage bucket).

matthewdeng · 2023-07-13T17:28:10Z

templates/distributed-training/README.md

+```bash
+gsutil ls $ANYSCALE_ARTIFACT_STORAGE
+```
+Authentication is automatcially handled by default.


Suggested change

Authentication is automatcially handled by default.

Authentication is automatically handled by default.

matthewdeng · 2023-07-13T17:29:56Z

templates/distributed-training/README.md

+### Submit as Anyscale Production Job
+From within your Anyscale Workspace, you can run your script as an Anyscale Job. This might be useful if you want to run things in production and have a long running job. You can test that each Anyscale Job will spin up its own cluster (with the same compute config and cluster environment as the Workspace) and run the script.  The Anyscale Job will automatically retry in event of failure and provides monitoring via the Ray Dashboard and Grafana. 
+
+To submit as a Production Job you can run:


Could we have consistency in the naming? We use "Anyscale Production Job", "Anyscale Job", and "Production Job" here - it may not be obvious to the user that all three of these combinations are meant to be the same thing 😄

matthewdeng · 2023-07-13T18:21:00Z

templates/distributed-training/pytorch.py

+    for _ in range(epochs):
+        train_epoch(train_dataloader, model, loss_fn, optimizer)
+        loss = validate_epoch(test_dataloader, model, loss_fn)
+        session.report(dict(loss=loss))


We should save a checkpoint here 😄

matthewdeng · 2023-07-13T18:21:41Z

templates/distributed-training/pytorch.py

+    parser.add_argument(
+        "--smoke-test",
+        action="store_true",
+        default=False,
+        help="Finish quickly for testing.",
+    )


This isn't used.

matthewdeng · 2023-07-13T18:24:22Z

configs/distributed-training/aws.yaml

+  min_workers: 1
+  max_workers: 3


I don't think these values really make sense with the current training script. If we are showing distributed training with 2 GPUs, I think we should either have min_workers be 2 (to make the script run immediately) or 1 (if we want to show autoscaling).

matthewdeng · 2023-07-13T18:28:44Z

templates/distributed-training/pytorch.py

+    parser.add_argument(
+        "--use-gpu", action="store_true", default=True, help="Enables GPU training"
+    )


I think this makes it so that this value is always true?

mjconnor added 6 commits July 12, 2023 08:30

distributed training tutorial

03d37d3

updating

36542b4

finalizing training template

5fe2ca2

final updates

fe043be

artifact storage

ba705af

more info on seeing model in artifact store

f17c34c

mjconnor requested a review from akshay-anyscale July 12, 2023 22:00

akshay-anyscale requested a review from matthewdeng July 13, 2023 00:41

matthewdeng reviewed Jul 13, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training #8

Distributed training #8

mjconnor commented Jul 12, 2023

akshay-anyscale commented Jul 13, 2023

mjconnor commented Jul 13, 2023

matthewdeng Jul 13, 2023

matthewdeng Jul 13, 2023

matthewdeng Jul 13, 2023

matthewdeng Jul 13, 2023

matthewdeng Jul 13, 2023

matthewdeng Jul 13, 2023

matthewdeng Jul 13, 2023

matthewdeng Jul 13, 2023

matthewdeng Jul 13, 2023

	Authentication is automatcially handled by default.
	Authentication is automatically handled by default.

Distributed training #8

Are you sure you want to change the base?

Distributed training #8

Conversation

mjconnor commented Jul 12, 2023

akshay-anyscale commented Jul 13, 2023

mjconnor commented Jul 13, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment